Untargeted Metabolomics for Integrative Taxonomy: Metabolomics, DNA Marker-Based Sequencing, and Phenotype Bioimaging

Integrative taxonomy is a fundamental part of biodiversity and combines traditional morphology with additional methods such as DNA sequencing or biochemistry. Here, we aim to establish untargeted metabolomics for use in chemotaxonomy. We used three thallose liverwort species Riccia glauca, R. sorocarpa, and R. warnstorfii (order Marchantiales, Ricciaceae) with Lunularia cruciata (order Marchantiales, Lunulariacea) as an outgroup. Liquid chromatography high-resolution mass-spectrometry (UPLC/ESI-QTOF-MS) with data-dependent acquisition (DDA-MS) were integrated with DNA marker-based sequencing of the trnL-trnF region and high-resolution bioimaging. Our untargeted chemotaxonomy methodology enables us to distinguish taxa based on chemophenetic markers at different levels of complexity: (1) molecules, (2) compound classes, (3) compound superclasses, and (4) molecular descriptors. For the investigated Riccia species, we identified 71 chemophenetic markers at the molecular level, a characteristic composition in 21 compound classes, and 21 molecular descriptors largely indicating electron state, presence of chemical motifs, and hydrogen bonds. Our untargeted approach revealed many chemophenetic markers at different complexity levels that can provide more mechanistic insight into phylogenetic delimitation of species within a clade than genetic-based methods coupled with traditional morphology-based information. However, analytical and bioinformatics analysis methods still need to be better integrated to link the chemophenetic information at multiple scales.


Introduction
Taxonomy is a fundamental part of biodiversity research that seeks to characterize, classify, and name biological species [1]. Integrative taxonomy combines traditional morphology-based taxonomy with additional methods such as DNA sequencing or the selection of chemophenetic markers using biochemistry [1]. While DNA sequencing has been applied to a wide range of species [2], certain groups such as bryophytic liverworts, including the herein investigated species of the genus Riccia, have been found to be very challenging to sequence, predominantly due a high abundance of glycosides, polyphenols, flavonoids, tannins, fatty acids, and other specialized metabolites that coprecipitate with Plants 2023, 12, 881 3 of 21 tive taxonomy. Thus, we present minimum reference data integrating the three domains: (a) chemotaxonomy, for the estimation of molecular chemophenetic markers using untargeted metabolomics (LC/MS-MS); (b) bioimaging, for the assessment of phenotypes and to allow for an estimation of morphological, anatomical, and phenotypic characters; and (c) DNA sequencing, for the determination of the phylogenetic relationship, which we treat as ground-truth information.

Phenotypic Analysis (Bioimaging)
The bioimaging dataset consisted of a total of 15,615 raw images, 276 segmented images, and 40 fully processed images. Figure 1 shows an overview of the images of the main phenotypic characters. taxonomy. Thus, we present minimum reference data integrating the three domains: (a) chemotaxonomy, for the estimation of molecular chemophenetic markers using untargeted metabolomics (LC/MS-MS); (b) bioimaging, for the assessment of phenotypes and to allow for an estimation of morphological, anatomical, and phenotypic characters; and (c) DNA sequencing, for the determination of the phylogenetic relationship, which we treat as ground-truth information.

Phenotypic Analysis (Bioimaging)
The bioimaging dataset consisted of a total of 15,615 raw images, 276 segmented images, and 40 fully processed images. Figure 1 shows an overview of the images of the main phenotypic characters.  To demonstrate in principle how molecular traits can be linked to the phenotype, spectra of images of the statures (Figure 1, second row from the top) were determined for the Riccia species ( Figure 2). Here, the thalli of the different species show different coloration (especially in the blue spectral components) ( Figure 2). To demonstrate in principle how molecular traits can be linked to the phenotype, spectra of images of the statures (Figure 1, second row from the top) were determined for the Riccia species ( Figure 2). Here, the thalli of the different species show different coloration (especially in the blue spectral components) ( Figure 2). An ordination using distance-based redundancy analysis (dbRDA) was performed to obtain the molecular compound classes that correlate with the spectral components ( Figure 3). The coloration of R. glauca was largely characterized by molecules of the class trifluoromethylbenzenes, whereas the other two species were characterized by specific high or low abundances in monosaccharides, specific flavonoid-glycosides, and longchain fatty acids ( Figure 3).  An ordination using distance-based redundancy analysis (dbRDA) was performed to obtain the molecular compound classes that correlate with the spectral components ( Figure 3). The coloration of R. glauca was largely characterized by molecules of the class trifluoromethylbenzenes, whereas the other two species were characterized by specific high or low abundances in monosaccharides, specific flavonoid-glycosides, and long-chain fatty acids ( Figure 3).
To demonstrate in principle how molecular traits can be linked to the phe spectra of images of the statures (Figure 1, second row from the top) were determ the Riccia species ( Figure 2). Here, the thalli of the different species show differen tion (especially in the blue spectral components) ( Figure 2). An ordination using distance-based redundancy analysis (dbRDA) was pe to obtain the molecular compound classes that correlate with the spectral com ( Figure 3). The coloration of R. glauca was largely characterized by molecules of t trifluoromethylbenzenes, whereas the other two species were characterized by high or low abundances in monosaccharides, specific flavonoid-glycosides, an chain fatty acids ( Figure 3).

Chemotaxonomic Analysis Characterizing the Riccia Species Infragenerically
Metabolite profiles of the Riccia species were investigated with three biological replications for each species using untargeted high-resolution mass-spectrometry. A total of 6010 and 3671 metabolite features were successfully quantified in positive ion mode and negative ion mode, respectively. As metabolite features include redundant information on adducts, isotopes and in-source fragments, the profiles were subjected to a second stage of analytical fragmentation (data-dependent acquisition, or DDA-MS) which resulted in 442 high-quality MS2 fragment spectra in positive and negative ion modes for peaks detected in at least 70% of samples.
To select chemophenetic markers that characterize the three Riccia species at different levels, a metabolite feature table including the abundances of the MS1 precursors in positive and negative ion modes was used to obtain markers at the molecular level. The MS2 fragmentation data was used to identify markers using SIRIUS and to classify spectra at the compound class level using CANOPUS. Molecular descriptors were calculated for the annotated spectra and a descriptor table generated by performing a matrix operation with the feature table constrained for the annotated spectra. Chemotaxonomic trees were generated to compare chemotaxonomic results at different levels (Figure 4b-e) with the phylogenetic information obtained from DNA sequencing (Figure 4a). Significant chemophenetic markers were then selected using PLS-DA and visualized using heatmaps ( Figure 5).   The MS-MS fragment spectra in the infrageneric Riccia group were also checked for known compounds from the libraries MassBank [38], LOTUS [39], and KNApSAcK [40]. Table S2 summarizes these results.

Chemotaxonomic Analysis Characterizing the Riccia Species at the Genus Level from the Outgroup
Metabolite profiles were investigated as above, resulting in a total of 7340 metabolite features that were successfully quantified in positive ion mode and 4322 features in negative ion mode. Performing DDA-MS resulted in 682 high-quality MS-MS fragment spectra in positive and negative ion modes. Classification was performed using CANOPUS and resulted in a total of 103 annotated compound classes. The occurrences of the compound classes were counted and sunburst plots were generated for the Riccia species and the outgroup species Lunularia cruciata ( Figure 6). The greatest differences were found in The MS-MS fragment spectra in the infrageneric Riccia group were also checked for known compounds from the libraries MassBank [38], LOTUS [39], and KNApSAcK [40]. Table S2 summarizes these results.

Chemotaxonomic Analysis Characterizing the Riccia Species at the Genus Level from the Outgroup
Metabolite profiles were investigated as above, resulting in a total of 7340 metabolite features that were successfully quantified in positive ion mode and 4322 features in negative ion mode. Performing DDA-MS resulted in 682 high-quality MS-MS fragment spectra in positive and negative ion modes. Classification was performed using CANOPUS and resulted in a total of 103 annotated compound classes. The occurrences of the compound classes were counted and sunburst plots were generated for the Riccia species and the outgroup species Lunularia cruciata ( Figure 6). The greatest differences were found in the compound classes of amino acids and derivatives, fatty acyls, glycosyl compounds, and benzenoids ( Figure 6).
To select chemophenetic markers that characterize the three Riccia species at the genus level from the outgroup (represented by Lunularia cruciata), a metabolite feature table, computational classification tables, and molecular descriptors were determined as above. Chemotaxonomic trees were generated to compare results at different levels (Figure 7b-e) with the phylogenetic information obtained from DNA sequencing (Figure 7a). Significant chemophenetic markers were then selected using PLS-DA and visualized using heatmaps ( Figure 8). Table S3 lists the selected chemophenetic markers separating the Riccia species from the outgroup at the molecular level.
Plants 2023, 12, x FOR PEER REVIEW 7 of 21 the compound classes of amino acids and derivatives, fatty acyls, glycosyl compounds, and benzenoids ( Figure 6). To select chemophenetic markers that characterize the three Riccia species at the genus level from the outgroup (represented by Lunularia cruciata), a metabolite feature table, computational classification tables, and molecular descriptors were determined as above. Chemotaxonomic trees were generated to compare results at different levels (Figure 7be) with the phylogenetic information obtained from DNA sequencing (Figure 7a). Significant chemophenetic markers were then selected using PLS-DA and visualized using heatmaps ( Figure 8). Table S3 lists the selected chemophenetic markers separating the Riccia species from the outgroup at the molecular level.   the compound classes of amino acids and derivatives, fatty acyls, glycosyl compounds, and benzenoids ( Figure 6). To select chemophenetic markers that characterize the three Riccia species at the genus level from the outgroup (represented by Lunularia cruciata), a metabolite feature table, computational classification tables, and molecular descriptors were determined as above. Chemotaxonomic trees were generated to compare results at different levels (Figure 7be) with the phylogenetic information obtained from DNA sequencing (Figure 7a). Significant chemophenetic markers were then selected using PLS-DA and visualized using heatmaps ( Figure 8). Table S3 lists the selected chemophenetic markers separating the Riccia species from the outgroup at the molecular level.

DNA Sequence Analysis
The plastid trnL-trnF sequence DNA dataset for six taxa included 550 aligned positions and contained new sequences of Riccia sorocarpa and R. warnstorfii, and Lunularia cruciata as an outgroup. The topology of the trees inferred by ML and BI analyses were largely identical, although their statistical supports of the ML tree were lower than of BI tree ( Figure 9). Thus, the phylogenetic position of R. glauca could not be determined of the ML tree, whereas R. warnstorfii and S. subbifurca are sister taxa (Figure 9a). On the BI tree, R. glauca is sister to R. beyrichiana and R. sorocarpa, whereas the phylogenetic positions of R. warnstorfii and R. subbifurca are unresolved (Figure 9b).

DNA Sequence Analysis
The plastid trnL-trnF sequence DNA dataset for six taxa included 550 aligned positions and contained new sequences of Riccia sorocarpa and R. warnstorfii, and Lunularia cruciata as an outgroup. The topology of the trees inferred by ML and BI analyses were largely identical, although their statistical supports of the ML tree were lower than of BI tree ( Figure 9). Thus, the phylogenetic position of R. glauca could not be determined of the ML tree, whereas R. warnstorfii and S. subbifurca are sister taxa (Figure 9a). On the BI tree, R. glauca is sister to R. beyrichiana and R. sorocarpa, whereas the phylogenetic positions of R. warnstorfii and R. subbifurca are unresolved (Figure 9b).

Discussion
In this section, we discuss the three domains from which we integrated data and discuss novel insights from and the applicability of our untargeted chemotaxonomy approach for integrative taxonomy.

DNA Sequence Data
We performed DNA sequencing of the trnL-trnF plastid region of the three Riccia species and the outgroup species L. cruciata to obtain the phylogenetic relationships, which we treated as ground-truth information (the expected result) and compared it to the chemotaxonomy information at various levels. Prior to this study, sequencing data were not available for R. warnstorfii.

Bioimaging Data
Reference bioimaging data were generated from raw microscopic images and linked to technical and expressive metadata using standardized semantics [41][42][43][44]. When extracting phenotypic traits from bioimaging data, it is possible to estimate both quantitative traits (i.e., leaf and stem area, length, width of leaves, stems and plants, specific leaf area, specific stem density) and qualitative traits (i.e., growth stature, vegetative propagule, or leaf shape and type) by combining elemental analysis with machine-learning-driven image analysis and computer vision [45,46]. Recently, it has been shown that plant biomass accumulation can be predicted from image-derived parameters alone [47], making bioimaging analysis an emerging and powerful tool for various applications in ecology [48,49].
Here, we demonstrate how to investigate spectral components of stature images to obtain differences in the coloration of the thalli of the different species and how to relate this information to the chemical components found in the tested Riccia species. Under this exemplary framework, phenotypic and chemotaxonomic data could be integrated.

Chemotaxonomic Data
Over the past 10 years, tremendous progress has been made in the technology of (untargeted) metabolomics. Using mass spectrometry, it is now possible to measure and annotate nearly all low-molecular-weight (typically <1000 Da) semi-polar compounds in organisms at once without targeting specific compounds, covering a wide range of research questions [29,50,51]. Here, we used untargeted liquid chromatography high-resolution mass spectrometry (UPLC/ESI-QTOF-MS) with data-dependent acquisition (DDA-MS) and the computational annotation tool SIRIUS [29] to annotate and classify molecules, including metabolic compounds and related metabolite families [52]. Moreover, we also determined molecular descriptors for annotated compounds and discuss their role in characterizing the individual metabolic responses of species. Prior to this study, no metabolite profiles were available for any Riccia species. Data have been deposited to MetaboLights and are available as MTBLS4668.

Discussion
In this section, we discuss the three domains from which we integrated data and discuss novel insights from and the applicability of our untargeted chemotaxonomy approach for integrative taxonomy.

DNA Sequence Data
We performed DNA sequencing of the trnL-trnF plastid region of the three Riccia species and the outgroup species L. cruciata to obtain the phylogenetic relationships, which we treated as ground-truth information (the expected result) and compared it to the chemotaxonomy information at various levels. Prior to this study, sequencing data were not available for R. warnstorfii.

Bioimaging Data
Reference bioimaging data were generated from raw microscopic images and linked to technical and expressive metadata using standardized semantics [41][42][43][44]. When extracting phenotypic traits from bioimaging data, it is possible to estimate both quantitative traits (i.e., leaf and stem area, length, width of leaves, stems and plants, specific leaf area, specific stem density) and qualitative traits (i.e., growth stature, vegetative propagule, or leaf shape and type) by combining elemental analysis with machine-learning-driven image analysis and computer vision [45,46]. Recently, it has been shown that plant biomass accumulation can be predicted from image-derived parameters alone [47], making bioimaging analysis an emerging and powerful tool for various applications in ecology [48,49].
Here, we demonstrate how to investigate spectral components of stature images to obtain differences in the coloration of the thalli of the different species and how to relate this information to the chemical components found in the tested Riccia species. Under this exemplary framework, phenotypic and chemotaxonomic data could be integrated.

Chemotaxonomic Data
Over the past 10 years, tremendous progress has been made in the technology of (untargeted) metabolomics. Using mass spectrometry, it is now possible to measure and annotate nearly all low-molecular-weight (typically <1000 Da) semi-polar compounds in organisms at once without targeting specific compounds, covering a wide range of research questions [29,50,51]. Here, we used untargeted liquid chromatography high-resolution mass spectrometry (UPLC/ESI-QTOF-MS) with data-dependent acquisition (DDA-MS) and the computational annotation tool SIRIUS [29] to annotate and classify molecules, including metabolic compounds and related metabolite families [52]. Moreover, we also determined molecular descriptors for annotated compounds and discuss their role in characterizing the individual metabolic responses of species. Prior to this study, no metabolite profiles were available for any Riccia species. Data have been deposited to MetaboLights and are available as MTBLS4668.
In order to ensure a high level of quality, we subjected the data to extensive quality control (QC) to ensure that data were recorded and annotated correctly. This was accomplished by recording properties using biological replicates and by creating expressive metadata throughout the entire data-processing procedure. Metabolomics instrument performance and detection of batch effects in the metabolomics data was realized following an established QC protocol [53]. In the MetaboLights repository (study identifier MTBLS4668) [54], we provided blank samples at the beginning and the end of each chromatographic batch run to ensure that no significant shifts in mass-to-charge (m/z) and intensities had occurred during the run. We further provided samples with standard compounds (coumarins, MeOH, methanol) to validate known ionization properties to detect shifts and other effects in retention times and m/z. The QC pipeline of our data allows for re-analysis of standardized data in the context of large-scale chemotaxonomy studies [55].

Novel Insights from Untargeted Chemotaxonomy
Our principal study revealed large infrageneric molecular differences in the investigated Riccia species. This resulted in many chemophenetic markers that can potentially be used in chemotaxonomy. The Riccia species were collected from the same field site and, despite variations in environmental conditions likely being low, the metabolite profiles of Riccia were more dissimilar to those of L. cruciata, suggesting that Riccia taxa have a slightly more divergent metabolism. These results are in line with earlier reports for the genus [6]. This fact makes chemophenetic analyses very interesting for devising phylogenetic relationships within the genus Riccia. The high level of metabolic divergence also supports the view that liverworts in general are interacting predominantly at the metabolic level [56,57] through cryptic traits that do not manifest necessarily in the phenotype [31]. Analyzing the composition of compound classes in the different Riccia taxa at the level of subclasses [58] revealed similarly large infrageneric differences. This suggests that molecular differences can be generalized and that taxa have evolved characteristic strategies that mirror their phylogenetic status.
Phytochemical investigation and compound classification confirmed the presence of many acetylenic fatty acids (as shown in Figure 3) previously reported to be very characteristic for Riccia taxa and that rarely occur in other Marchantiophyta [32]. We also confirmed the presence of flavonoids such as apigenin and luteolin glucuronides (listed in Table S2) that have previously been reported to be restricted within bryophytes to the family of Ricciaceae [33,34]. Devising chemophenetic markers also offered us insight into rates of chemical evolution and diversification [31]. As early land plants, Riccia liverworts may have evolved unique glucuronide compounds serving as protection against excessive UV sunlight as has been shown for some glucuronides in Marchantia polymorpha [59]. Further, we could not detect Riccionidin A and B, which have previously been found in the liverworts Riccia duplex, Marchantia polymorpha, and Scapania undulata [35], suggesting that Riccionidins are either restricted to a few specialized taxa or only occur in low abundance in the species, which would not be detectable by our instrumental data-dependent acquisition setup. We also found many unique flavonoid glycosides and hydroxylated flavonoids in the Riccia profiles, which is in contrast to Marchantia spp. and Lunularia spp., which contain many unique stilbenes and (neo)lignans such as bisbibenzyls [27].
The annotation of untargeted LC/MS-MS data is still a complex task, as the natural product chemistry of bryophytes is not well known, and as a result, spectral libraries such as MassBank, GNPS, WeizMass, or Lotus only contain a few reference structures [24,38,60,61]. In order to unequivocally identify compounds, either authentic standards or additional elaborate analytical methods such as NMR are necessary [27]. However, in order to devise chemophenetic markers, we find that an annotation at the class level is sufficient to characterize the distinct Riccia taxa.
Here, we found a large infrageneric chemical diversity in the tested Riccia species. Although our analytical extraction method and the data-dependent acquisition were optimized to acquire plant metabolites, endophytic fungi may have contributed specialized metabolites to the overall phytochemical profiles [6,62], similarly by exogeneous mycorrhizal fungi [63,64]. We also cannot rule out secondary colonization by microbials, as the Riccia thalli were mature, and upon spore development, thalli release spores by becoming cavernous. Our experimental setup minimized contamination from exogenous and epiphytic species such as rhizoids, and any dirt and soil residues were remove from the thalli. The diversity of the chemical profiles may also be influenced by life stage, as Riccia samples were mature, containing spores.
Our principal investigation revealed that chemophenetic markers can be interpreted at different abstraction levels, providing different resolutions. Using untargeted metabolomics, we found that analyses at the more abstracted compound class and superclass levels [58] still provide a meaningful taxonomic interpretation [55]. However, care needs to be taken, as with every abstraction, variance is harmonized and yet may lead to the biased interpretation of overrepresented signals.
Overall, the investigated taxa displayed a high dissimilarity in their profiles. The PLS-DA model was able to differentiate the taxa at the molecule and subclass level with near-perfect separation. Evaluating the chemical composition at these two levels can thus support detailed insight into phylogenetic parentage of these taxa and be considered a viable alternative when genetic or morphological methods are inconclusive.

Applicability of Untargeted Chemotaxonomy
Our untargeted metabolomics methodology allowed us to distinguish taxa based on chemophenetic markers at different levels of complexity: (1) molecules, (2) compound classes, (3) compound superclasses, and (4) molecular descriptors. We aligned the results of the clade Riccia to the outgroup species Lunularia cruciata and compared the results at different levels with the reference information obtained from DNA sequencing. Our methodology is in contrast to many other chemotaxonomic studies that usually focus on only a few predominant classes, such as phenols [16,18]. In summary, we found large infrageneric differences in the tested Riccia species that were the result of distinctly produced molecules and marked differences in the composition of numerous compound classes. In conclusion, our data allowed us to devise chemophenetic markers using untargeted metabolomics at the molecular level based on presence-absence or based on abundances in the order of magnitude. Biomarker molecules are characterized by a specific mass-to-charge ratio that corresponds to the mass of the molecule, retention time, which is specific to the mass spectrometry, and the abundance, which corresponds to the ionization within the chromatographic column [65]. Once biomarkers have been determined, they can also be detected using FT-IR or thin-layer chromatography [20,21,66]. Moreover, we introduced compound classification to chemotaxonomy that allowed us to relate the composition and constitution of compound classes to biological taxa. While in the research field of eco-metabolomics, compound classification has become a powerful tool to generalize and simplify overly complex eco-molecular functioning [67], we conclude here that in silico compound classification is also applicable to resolve taxonomic relationships and may even be better suited than analytical approaches using fluorescence, spectrophotometers [68], or certain extraction procedures [69]. In this study, we investigated the metabolite profiles of three Riccia species growing at one location. In order to generalize findings and to devise chemophenetic biomarkers at different complexity levels, we recommend investigating species at several different locations to resolve the robustness of the chemophenetic markers. More research is clearly needed to assess and compare the resolution of these methods.

Integration of Untargeted Metabolomics into Integrative Taxonomy
In this study, we showed how chemotaxonomic data using untargeted metabolomics can be integrated into integrative taxonomy, which usually involves genetic and phenotypic data. Using untargeted metabolomics, we obtained chemophenetic markers at different levels of complexity: (1) molecules, (2) compound classes, (3) compound superclasses, and (4) molecular descriptors. We generated taxonomic trees from the data at the different levels and compared these trees with the reference information of taxonomic relationships of the species obtained from genetic markers. We found that chemotaxonomy can lead to more detailed information (more chemophenetic markers that distinctly characterize the different taxa) than the information from using genetic markers alone. However, the wealth of additional chemophenetic information needs to be carefully interpreted, as molecules can also represent individual responses of species to ecological factors, which may influence the taxonomic interpretation [70]. Thus, integrating the data for use in integrative taxonomy demands an evaluation of the metabolic state of the investigated taxa and an experiment design that minimizes any environmental or ecological influence. Lastly, by extracting spectral components of images of thallus phenotypes, we demonstrate in principle how differences in the coloration of the thalli relate to the molecular components found in the tested Riccia species.

Sample Collection and Processing
Samples of Riccia glauca, R. sorocarpa, and R. warnstorfii were collected by Uwe Schwarz from an arable stubble field near Aichtal Grötzingen in Baden-Württemberg, Germany on 09/13/2021 (geographic coordinates: 48.638275 N, 9.2534083 E, elevation: 376 m, precision: 10 m). Lunularia cruciata (L.) Dumort. ex Lindb. was additionally sampled near the lab site on 12/08/2021 at 51.494848 N, 11.942323 E and chosen as the outgroup species. The specimens were brought to the lab at IPB in sterile petri dishes, where plant material was isolated, washed under a light microscope to remove dirt and other residues, filled into Eppendorf tubes, and shock-frozen. Voucher specimens were stored in the herbarium Haussknecht Jena (voucher id's: R. glauca: JE04010991, R. sorocarpa: JE04010990, R. warnstorfii: JE04010989, L. cruciata: JE04010993). For the metabolomics analysis, three biological replicates were used for each specimen. Table S4 gives an overview on samples and their use for the different types of analyses.
All new sequences were edited by eye in Sequencher 5.0 (Gene Codes Corporation, Ann Arbor, USA). The automatically performed alignment of six taxa was manually adjusted in Geneious 9.1.6 [75]. The phylogenetic reconstructions for the plastid region of trnL-trnF were conducted with Maximum Likelihood (ML) and Bayesian Inference (BI) methods. ML searches and bootstrap estimations of clade support were conducted with RAxML 8.2 [76] using RAxML BlackBox with default settings on the CIPRES Science Gateway [77]. On the same platform, the software MrBayes version 3.2.7a was used with the following parameters: rates = invgamma, ngen = 10,000,000, samplefreq = 1000 to estimate the posterior probabilities (PP) of the Bayesian analyses. The trees were visualized with FigTree 1.4.4 (https://github.com/rambaut/figtree; accessed on 12 February 2023). Sequencing data was deposited to Genbank and is available under the following accession numbers: R. sorocarpa OQ318168, R. warnstorfii OQ318167, L. cruciata OQ318169.

Phenotypic Analysis (Bioimaging)
Acquisition of images was based on the methods described in [78] and was only slightly modified for thallose liverwort species. In short, a Zeiss Axio Scope.A1 microscope was used for brightfield microscopy. For macroscopy and for preparing microscopy slides, a binocular microscope Zeiss Stemi 2000c was used. For macroscopic images, the Venus Optics Laowa 25 mm 2.5-5.0× ultra-macro for Canon EF was used. Digital images were acquired with a full-frame, high-resolution camera (Canon EOS RP, 26 megapixel).
To construct images with extended depth-of-field, images were recorded at focal planes at different z-layers. Raw images were pre-processed with Adobe Camera RAW and then exported to TIFF format while recording any image processing steps as metadata in Adobe XMP format. Multi-focus image fusion and image stitching were performed to improve the resolution of the final images Helicon Focus 8.1.1 (https://www.heliconsoft. com/heliconsoft-products/helicon-focus/; accessed on 12 February 2023) and Affinity Photo 1.10.5 (https://affinity.serif.com/en-us/photo/; accessed on 12 February 2023).
Images were manually segmented and interfering background removed using the flood select, brush selection, and freehand selection tools in the software Affinity Photo. Microscopic scales were then placed onto the segmented images using the approach described in [78].
Image features were estimated using the R package EBImage [79] by extracting the histograms of the red, green, and blue channels of the bioimages representing the visible spectra of the thalli of the different species. Distance-based ReDundancy Analyses (dbRDA) were performed using the dbrda function of the package vegan to investigate relationships of the image properties and the molecular traits [50]. Spectral values other than pure black (all RGB channels zero) and pure white (all RGB channels one) were extracted from the histogram models and used as traits in a dbRDA model. A Euclidean distance measure was used for the ordination. The dbRDA model with the largest explained variance was chosen using forward variable selection and the ordistep function. The goodness of fit statistic (squared correlation coefficient) was determined for the remaining variables by applying the envfit function on the dbRDA ordination model.
Raw camera and pre-processed imaging data were deposited to the BioImage Archive (BioStudies) [41] and are available under the identifier S-BIAD443 (https://www.ebi.ac.uk/ biostudies/studies/S-BIAD443; accessed on 12 February 2023). Processed images and metadata were deposited to the Image Data Resource under accession number idr0137 [80,81].

Metabolite Extraction and Untargeted Mass-Spectrometry
We followed extraction procedures for LC/MS originally developed for vascular plants by [82] and modified slightly for bryophytes [53]. This method has been shown to provide robust results for the compound classes we studied [83]. A detailed description of the protocol and methods can be found in [84]. In brief, frozen plants were homogenized in a ball mill at 25 Hz for 50 s and extracted with 1 mL of 80:20 MeOH:H 2 O. Samples were shaken at room temperature for 15 min at 1000 rpm, then sonicated for 15 min and shaken again for 15 min at 1000 rpm. Samples were centrifuged for 15 min at 13,000 rpm for 15 min; 750 µL of the supernatant was collected and concentrated in vacuo. Then, they were reconstituted to 10 mg fresh weight/100 µL with 80:20 MeOH:H2O and injected into a Bruker Elite HPLC equipped with a Nucleodur X18 Gravity-SB column (1.8 µm 100 × 2 Macherey Nagel, Dueren, Germany) and coupled to a Bruker TIMS-TOF (timsTOF Pro, Bruker, Bremen, Germany). Separate injections were performed for the positive and negative mode.
Data-dependent acquisition (DDA-MS) mode was used with the instrument settings described in [84]. Due to different injection order, the second to fourth samples of R. warnstorfii were used for the analysis.

Raw Data and MS1 Data Processing
Raw data converted into mzML format using msconvert [85] as well as derived data (SIRIUS project folders, RData) were deposited in MetaboLights under the study identifier MTBLS4668 [54]. Metadata were recorded in compliance with the minimum information guidelines for Metabolomics studies [86].
The MS1-level peak tables were created separately for positive and negative ion modes with the settings featureValues, method = "medret", value = "into". The peak tables were log-transformed, and missing values were imputed with zeros. Histograms and PCA diagnostic plots were generated to additionally evaluate the distribution of the data. MS2-level fragment spectra (MS/MS spectra) that were acquired by the Data-Dependent Acquisition mode (DDA-MS) were extracted from the profiles using the chromPeakSpectra, msLevel = 2 L, return.type = "Spectra" settings of XCMS. Spectra obtained from the same precursor ion were combined using the combineSpectra function from the R package Spectra using the following settings: FUN = combinePeaks, ppm = 35, peaks = "union", minProp = 0.8, intensityFun = median, mzFun = median, backend = MsBackendDataFrame. This step was performed separately for positive and negative ion modes. The MS1-level peak tables were then filtered to include only peaks for which the DDA-MS had acquired MS/MS fragment spectra. The spectra were saved in MSP and MGF files for further data processing.
As standard variance and median values were within 10% deviations, the filtered MS1-level peak tables containing log-transformed abundances of peaks in positive and negative ion modes were joined and used for further statistical analyses. Presence/absence peak tables were also generated to contain whether a metabolite feature was detected in the profiles. Features with abundances less than 10 −8 % of the median abundance were considered absent.

Processing of MS/MS Data
Identification of MS/MS fragment spectra was carried out using the software SIR-IUS version 5.6 [29]. The following settings were used for both ionizations: IsotopeSettings.filter = true, FormulaSearchDB, Timeout.secondsPerTree = 0, FormulaSettings. Identification was accomplished automatically by selecting the highest-ranking candidate for each spectrum. If the software could provide a COSMIC score [30], the candidate with the highest-ranking COSMIC score was selected. The corresponding SMILES and the compound classes provided by the CANOPUS [88] were extracted and stored for each spectrum. In addition, biomarkers were manually identified, and the most likely library match for bryophytes or plants was manually curated.
The classification provided by CANOPUS for each MS/MS fragment spectrum was aggregated and stored in a separate classification table. Compound classes were analyzed at the ChemOnt level of subclasses and superclasses. The classes were aggregated and counted for each spectrum found in a sample and multiplied by the peak abundances of the corresponding MS1 precursors.
The SMILES provided by the SIRIUS software for the MS/MS fragment spectra were saved to a text file and molecular descriptors were calculated using RDKit and its Python module [89]. The RDKit results were saved in a csv file, which in turn was analyzed in R. A data table was constructed corresponding to the feature table by performing a matrix operation of both tables. This data table was used for performing statistical analyses (see below).

Chemodiversity Analyses
To assess the overall chemical diversity, the richness was first determined representing the number of features, compounds, classes, or descriptors found in a sample, respectively. Second, the number of unique variables was determined that represents those variables that are present in one species but not the others. As a third diversity measure, the Shannon diversity index H' was determined according to [90]. Finally, the Pielou's evenness J that describes the homogeneity of the distribution of the intensity or abundance of compounds present in a species was determined according to [90]. To assess significant differences among the groups, ANOVA with post-hoc Tukey honestly significant difference (HSD) test was calculated, and the R packages vegan, multcomp, and multtest were used.
To get an overview on the chemical diversity of compound classes and their diversity among or across species, sunburst plots were constructed. They were implemented as a custom function [91] comprised as stacked barplots from the inside out, starting with organic compounds in the center. The classes further to the outside represented the more specialized classes. The classes were arranged at different levels based on the CHEMONT ontology [58].

Explorative and Unsupervised Multivariate Analyses
To discriminate species based on chemophenetic markers at different levels, principal components analysis (PCA) was performed using the prcomp function in R. In order to assess the influence of different study factors, variation partitioning was performed using the function varpart in the package vegan.

Selection of Chemophenetic Molecular Features
Chemophenetic markers were selected at the levels of MS1 features ("feature list"), MS1 features constrained to the availability of MS2 spectra ("compound list"), at the compound class and superclass levels ("class list", "superclass list"), and at the level of molecular descriptors ("descriptor list"). Variable selection was accomplished with the Partial Least Squares Discriminant Analysis (PLS-DA) using the caret package. A prediction model was trained using the train function from the caret package, and variable importance values were extracted from the model using the varImp function. Variables were selected (were considered significant) when their quantile threshold was above 0.995. In order to visualize significant relationships of the chemophenetic markers at the different levels, heatmaps were generated (using the gplots R package) from the selected variables.
To evaluate the performance of the fitted models, 10-fold cross-validation was performed (package mltest), and the Receiver Operating Characteristic (ROC) and PR (Precision and Recall) curves using the functions plot.roc and ci.se from the pROC package and the function pr.curve from the PRROC package were additionally constructed [92][93][94][95]. The R-squared of the fitted vs. the entire model and the area under curve (AUC) were calculated from the ROC, and the area under precision recall curve (AUC-PR) was determined from the PR curve.

Construction of Taxonomic Trees
Taxonomic trees were constructed by first calculating a distance matrix on the feature tables using the Euclidean distance measure, and then, clustering was performed using the complete method. The following R packages were used: ape, pvclust, dendextend, phangorn, Hmisc, gplots.

Deposition of Metabolomics Data
Raw metabolite profiles and the annotated feature tables were deposited in the Metabo-Lights repository (study identifier MTBLS4668) [54], along with QC samples consisting of blanks that were acquired at the beginning and at the end of each chromatographic batch run and samples containing standard compounds (coumarins, MeOH, methanol). Code to reproduce the results is available in GitHub (https://github.com/ipb-halle/iESTIMATE; accessed on 12 February 2023) [96].

Conclusions
Data on phenotypic, phylogenetic, and molecular traits of bryophytes are scarce but are needed to understand the individual responses of bryophytes with regard to characterizing, classifying, and naming species [1,97]. Integrating data that span multiple spatiotemporal scales, such as phenotypes, molecules, or DNA sequences is a key concept in integrative biodiversity research and will allow further linking of molecular processes to taxonomy and association of specific mechanistic characters of the species with their ecology and evolution [98]. In order to promote data re-use, we followed the FAIR principles, associated the datasets with rich metadata, and provide computational code to semi-automatically (re-)process the data [99,100]. Integrative taxonomy typically combines an assessment of phenotypes and DNA sequence markers to elucidate phylogenetic relationships of species [22,98]. In this study, we integrated untargeted metabolomics with DNA sequencing and phenotypic bioimages and show in principle how this integration allows for a more detailed taxonomic evaluation of the genus Riccia. We also showed how chemophenetic data allows for more realistic species circumscriptions [13]. The integrative data also allows investigation of the ecology and evolution of the species and can shed light on their origin and biogeographic history. Additionally, the integrative data will advance many related research areas such as functional ecology by investigating molecular traits [55], aiding global biodiversity synthesis efforts at various scales [101,102], and making connections between high-throughput biodiversity inventories to "classic" bryology, (digital) "collectomics" ("digitomics"), or data science [103]. The data may also be used in bioinformatics to train machine-learning models that may advance automated high-throughput analyses and pattern recognition [49].

Supplementary Materials:
The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/plants12040881/s1: Table S1: 71 chemophenetic marker molecules selected by PLS-DA characterizing the investigated Riccia species. Columns include the internal identifier used by the XCMS peak detection software, the mass-to-charge ratio of the precursor ion (m/z), retention time (RT) (s), compound name, the most specific compound class, the SMILES, and the level of annotation confidence (MSI level) according to [104]; Table S2: Compounds of interest. Table containing known compounds that were previously described in the literature to be characteristic for Riccia species; Table S3: Chemophenetic biomarker molecules selected by PLS-DA representative of each of the investigated Riccia species. Columns include the internal identifier used by the XCMS peak detection software, the mass-to-charge ratio of the precursor ion (m/z), retention time (RT) (s), compound name, the most specific compound class, the SMILES, and the level of annotation confidence (MSI level) according to [104]; Table S4: List of samples and their identification codes for use with the different types of analyses. Data Availability Statement: Raw camera and pre-processed imaging data were deposited to the BioImage Archive (BioStudies) and are available under the identifier S-BIAD443 (https://www.ebi. ac.uk/biostudies/studies/S-BIAD443). Processed images and metadata were deposited to the Image Data Resource under accession number idr0137 (https://doi.org/10.17867/10000185). Sequencing data were deposited to Genbank and are available under the following accession numbers: R. sorocarpa OQ318168, R. warnstorfii OQ318167, L. cruciata OQ318169. Raw metabolite profiles and the annotated feature tables were deposited to the MetaboLights repository under the study identifier MTBLS4668 (https://www.ebi.ac.uk/metabolights/MTBLS4668; accessed on 12 February 2023). Data analysis plots, images, and R code to reproduce the plots are available in Zenodo (https://doi.org/10.5281/ zenodo.7638304). Source code is available on GitHub (https://doi.org/10.5281/zenodo.7615220).

Conflicts of Interest:
The authors declare no conflict of interest.